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Abstract 

We introduce a dimension reduction method for visualizing the clustering structure ob¬ 
tained from a finite mixture of Gaussian densities. Information on the dimension reduction 
subspace is obtained from the variation on group means and, depending on the estimated 
mixture model, on the variation on group covariances. The proposed method aims at reduc¬ 
ing the dimensionality by identifying a set of linear combinations, ordered by importance as 
quantified by the associated eigenvalues, of the original features which capture most of the 
cluster structure contained in the data. Observations may then be projected onto such a 
reduced subspace, thus providing summary plots which help to visualize the clustering struc¬ 
ture. These plots can be particularly appealing in the case of high-dimensional data and 
noisy structure. The new constructed variables capture most of the clustering information 
available in the data, and they can be further reduced to improve clustering performance. 
We illustrate the approach on both simulated and real data sets. 

Keywords: Dimension reduction, Model-based clustering, Gaussian mixture. Sliced inverse 
regression. Feature selection. 


1 Introduction 


Model-based clustering assumes that the observed data are generated from a mixture of G com- 


lan and Peel 
where the tt, 


ponents, each representing the probability distribution for a different group or cluster (McLach-| 
2000). The general form of a finite mixture model is f{x) = Ylg=i'^gf 9 i^\^g): 
g s are the mixing probabilities such that % > 0 and X] % = 1) fgi') 6g 
are, respectively, the density and the parameters of the ^-th component {g = 1,..., G). With 
continuous data, we often take the density for each mixture component to be the multivariate 
Gaussian (^(a:|/Xg, Xlg) with parameters 9g = {fig,'Sg). Thus, clusters are ellipsoidal, centered 
at the means fj,g, and with other geometric features, such as volume, shape and orientation, 


determined by S 


g- 


Parsimonious parametrization of covariance matrices can be adopted through eigenvalue 


decomposition in the form = XgDgAgD~^, where Xg is a scalar controlling the volume of 
the ellipsoid, Ag is a diagonal matrix specifying the shape of the density contours, 
an orthogonal matrix which determines the orientation of the corresponding ellipsoid (Banfield 


and Dg is 


and Raftery, 1993 Celeux and Govaert, 1995). Fraley and Raftery (2006 Table 1) report some 


parametrizations of within-group covariance matrices available in the MCLUST software, and the 
corresponding geometric characteristics. Maximum likelihood estimates for this type of mixture 


models can be computed via the EM algorithm (Dempster et ah 1977; Fraley and Raftery, 2002), 


while model selection could be based on the Bayesian information criterion (BIG) (Fraley and 


Raftery, 1998) or the integrated complete-data likelihood (ICL) criterion (Biernacki et ah 2000). 


In this paper we propose a dimension reduction method for model-based clustering, where 
we pursue the estimation of a reduced subspace which captures most of the clustering structure 
contained in the data. Following the work of ^ (1991 2000) on Sliced Inverse Regression (SIR), 
information on the dimension reduction subspace is obtained from the variation on group means 


1 







































and, depending on the estimated mixture model, on the variation on group covariances. The 
estimated directions, ordered by importance as quantified by the associated eigenvalues, span 
a dimension reduction subspace. Observations may then be plotted on such a subspace, hence 
providing summary plots which visualize the clustering structure, and may help to understand 
the clustering and improve accuracy. 

In Section we introduce the problem of clustering on a dimension reduced subspace, and 
we show that assignment of an observation to a given cluster is unchanged if performed on 
a suitable subspace. Then we discuss estimation of the directions which span the reduced 
subspace, and we provide some of their properties. In Section we review and adapt a greedy 
search procedure for selecting a subset of the new constructed variables which retains most 
of the clustering information contained in the data. In Section we describe a data analysis 
example based on a synthetic data set; then we discuss results from simulation studies where 
we compare the proposed approach to model-based clustering performed on both the original 
variables and the leading principal components. Sectionpresents applications of the proposed 
procedure on some real data sets. The final section contains some concluding remarks. 


2 Dimension reduction for model-based clustering 


Classical procedures for dimensionality reduction are principal components analysis and factor 
analysis, both of which reduce dimensionality by forming linear combinations of the features. 
The first method seeks a lower-dimensional representation that account for the most variance 
of the features, while the second looks for the most correlation among the features. However, 
neither method correctly addresses the problem of visualizing any potential clustering structure. 


Chang (1983) discussed a simulated example from a 15-dimensional mixture model to show 


the failure of principal components as a method for reducing the dimension of the data before 
clustering. Let X = 0.5xd+dxY+Z, where d = 0.95—0.05f (z = 1,..., 15), T ~ Bernoulh(0.2), 
Z ~ N(//, H) with mean /.tisxi = [0,..., 0]"'" and covariance matrix Xlisxis = era = 1, o'ij = 
—0.13/i/j, where the hrst 8 elements of / are —0.9 and the last 7 are 0.5. With this scheme the 
first 8 variables can be considered roughly as a block of variables with the same correlations, 
while the rest of the variables form another block. Using this scheme we simulated n = 300 data 
points. Figureshows the scatterplot matrix of the first, second and last principal components, 
and the first GMMDR variable (to be discussed on Section 2.2) obtained from a two components 
FEE mixture model. As it can be seen, the first two PCs are not able to show any clustering 


information, but, as discussed by Chang (1983), the hrst and the last PCs are needed. On the 


contrary, only one GMMDR variable is required to clearly separate the clusters. Furthermore, 
the coefficients (under unit norm constraint) of the linear combination dehning such a direction 
reproduce the blocking structure used to simulate the variables, with the hrst 8 variables having 
coefficients approximately equal to —0.32 and the remaining 7 coefficients approximately equal 
to 0.16. 


Tipping and Bishop (1999b) proposed a probabilistic principal component analysis based on 


a factor analysis model for the data assuming that the distribution of the errors are isotropic, 
i.e. the covariance matrix is proportional to the identity matrix. They also developed a mixture 
of probabilistic principal components analyzers model ( [Tipping and Bishop 
archical visualization algorithm for exploring clusters ([Bishop and Tipping 


and a hier- 
McLachlan 


and Peel (2000) and McLachlan et al (2003) used mixtures of factor analyzers, by postulating 



that the distribution of the data within any group or latent class follows a factor analysis model. 
Parsimonious Gaussian mixture models based on mixtures of factor analyzers were discussed 


by McNicholas and Murphy (2008). In order to deal with high dimensional data, Bouveyron 


et al (2007) proposed a family of parsimonious GMMs on clustering subspaces. All such models 


mainly address the problem of reducing the number of parameters to be estimated by a suitable 
reparametrization of the covariance matrix. Since they produce estimates of (latent) factor an- 
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Figure 1: Scatterplot matrix of 1st, 2nd and 15th PC, and 1st GMMDR variable for the Chang 
data with points marked according to cluster membership. 


alyzers, they can also be used to reduce dimensionality. However, as noted by McLachlan and 
Peel (2000, p. 248), the resulting plots are not always useful in showing clustering structure 


To support such claim, they presented an example using a simple two clusters synthetic data 
set where the differences between the group means are only in one variable, and with within- 
group variation in the other variables being relatively large. They showed that both principal 
components and factor analyzers were unable to show the underlying clustering structure (see 
McLachlan and Peel, 2000, p. 240, 254-255). We applied the GMMDR method proposed in 


this paper, and the data projected onto the subspace spanned by the first two directions are 
shown in Figure the two clusters appear well separated with one group showing less variation 
than the other. 


2.1 Clustering on a dimension reduced subspace 

Consider the (p x 1) vector X of random variables, and the discrete random variable Y taking 
G distinct values to indicate the G clusters. Let f3 denote a fixed {p x d) matrix, where d < p, 
such that Y^LX\|3^X. This conditional independence statement says that the distribution of 
Y\X is the same as that of Y\I3^X for all values of X in its marginal sample space. As a 
consequence, the (p x 1) vector X can be replaced by the {d x 1) vector (3^X without loss 
of clustering information. If d < p, we have effectively reduced the dimension of the predictor 
vector. The matrix /3 provides a basis for the subspace S{f3), which in a regression context is 
called a dimension-reduction subspace (DRS) for the regression of T on W (Li, 1991). It can 
be shown that a minimum DRS may not be unique, but when several of such subspaces exist 
they all have the same dimension. Le t denote the intersection of all DRSs. Under various 

reasonable conditions given in Cook (1998 Chap. 6), 5y|x is a unique minimum DRS and it is 
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Figure 2: Plot of McLachlan and Peel (2000, p. 239-240) synthetic data projected along the 
first two GMMDR directions, with points marked according to true group origin. The curve 
represents clustering boundary. 


called central DRS. 

The assumption Y^LX\|3^X implies that Pr(y = g\X) = Pr(y 
write the density for the g'-th component of the mixture model as 


fg{x) 


Pr(y = g\x)fx{x) 
Pr(y = g) 

Pr(y = g\(3^x)fx{x) 
Pr(y = g) 


fgiP^x 


fx{x) 




and for any two groups, g and k, the ratio of the densities is 

fgjx) ^ fgiP^x) 

fk{x) fkif^^x) 


glP'^x) , SO we may 


( 1 ) 


Thus, if YALX\(3^X the ratio of the conditional densities is the same whether it is computed 
on the original variables space or on the DRS, so the clustering information is completely 
contained in S{(3). For instance, consider the simple case of two clusters with bivariate Gaussian 
distribution x\{Y = j) ~ N{fj,j,l 2 ) for j = 1,2, where fii = [—IjO]"*", ^2 is 

the (2 X 2) identity matrix. The ratio of the densities is thus given by 


= exp{-^(a; - {x - + ^(a; - fi 2 V{x - = exp{-2xi}. 

h{x) 2 2 

It is straightforward to see that the direction containing all the clustering information is /3 = 
[1,0]^, so the ratio of the densities projected along such direction is 

= exp{-|(/3^® - 

f 2 {P x) 2 2 

= exp{-^(xi + 1)^ + - 1)2} = exp{-2xi}. 
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In model-based clustering an observation is assigned to cluster g by the maximum a posteriori 
principle (MAP), i.e. to the cluster for which the conditional probability given the data is a 
maximum: 

arggmaxPr(y = g\x) = 

By equation the last expression is equivalent to 

arggmaxPr(y = glP^ x) = 


TTgfgiP'^x) 

Ef=i ’ 




so the assignment of an observation to a cluster is unchanged if performed on the DRS. Cook 


and Yin (2001, p. 156) considered D{Y\X) = arggmaxPr(y = g\x) and defined a discriminant 
subspace as a subspace S such that D(Y\X) = D{Y\PsX) for all values of X, where Ps denotes 
the projection operator onto S with respect to the usual inner product. The intersection of all 
discriminant subspaces, if it exists, is itself a discriminant subspace, and is called the central 
discriminant subspace (CDS) and denoted by 5£)(y|x)- Finally, the CDS is a subset of the 
central DRS, 5 d(y|x) ^ ‘5y|x, and in some cases may be a proper subset. 

In the case of known group memberships, a variety of methods for dimensionality reduction 
have been proposed (Cook and Yin, 2001). These methods, such as Sliced Inverse Regression 


(SIR) and Sliced Average Variance Estimation (SAVE), are not restricted to any particular 
classification method. In the following subsection we discuss estimation of a dimension reduction 
subspace for a model-based clustering procedure. 


2.2 Estimation of GMMDR directions 


Suppose we describe a set of n observations on p variables through a G components Gaussian 


mixture model (GMM) of the form f{x) = Yl^=i ^g)- 

From a graphical point of view, we pursue the smallest subspace that can capture the 
clustering information contained in the data. A natural starting point would require to look for 
those directions where the cluster means pig vary as much as possible, provided each direction 
is orthogonal to the others. This amounts to solving the following optimization problem: 


arg^ subject to (3 ' El/3 = I, 


T^ 


where is the between-cluster covariance matrix, El = n ^ 

{xi — pi){xi — piY' is the covariance matrix with pi = P ^ is the spanning 

matrix, and Id is the {d x d) identity matrix. The solution to this constrained optmization 
is given by the eigendecomposition of the kernel matrix Mj = 'Sb with respect to El. The 
eigenvectors, corresponding to the first d largest eigenvalues, [ui,... ,Vd] = /3, provide a basis 
for the subspace S{f3) which shows the maximal variation among cluster means. There are at 
most d = min(p, G — 1) directions which span this subspace. 


This procedure is similar to the sliced inverse regression (SIR) algorithm (Li 1991). SIR is 


a dimension reduction method which exploits the information from the inverse mean function 
(see Li, 2000, Chap. 2). Instead of conditioning on the response variable (or a sliced version of 
it if continuous), here we condition on the cluster memberships. SIR directions span at least 


part of the dimension reduction subspace (Cook, 1998, Chap. 6) and they provide an intuitive 


and useful basis for constructing summary plots. However, they may miss relevant structures 
in the data when within-cluster covariances are different. 


The SIR// method (Li, 2000, Chap. 5) exploits the information coming from the differences 
in the class covariance matrices. The kernel matrix is now defined as: 


G 

Mil = ^7rg(Elg - El) 
g=i 


51-'(S,-E) 
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where "S = is the pooled within-cluster covariance matrix, and directions are found 

through the eigendecomposition of Mjj with respect to S. Albeit the corresponding directions 
show differences in group covariances, they are usually not able to also show location differences. 

The proposed approach aims at finding those directions which, depending on the selected 
Gaussian mixture model, are able to display both variation in cluster means and variation in 
cluster covariances. 


Definition 1. Consider the following kernel matrix 

M = + Mu. (2) 

The basis of the dimension reduction subspace 5(/3) is the solution of the following constrained 
optimization: 

aig^ max Mf3, subject to P~^1ip = Id- 

This is solved through the generalized eigendecomposition 

Mvi = li'Svi, vj Suj = 1 if i = j, and 0 otherwise, 
h ^ h ^ ^ > 0. 


Thus, the basis of the required subspace is given by the GMMDR directions /3 = [ui,..., u^]. 


The kernel matrix Q contains information from variation on both cluster means and cluster 
covariances. For mixture models which assume constant within-cluster covariance matrices, such 
as models E, Eli, EEI, and EEE (see Eraley and Raftery, 2006, Table 1), the subspace spanned 
by M is equivalent to that spanned by Mi. 


Remark 1. Let S{(3) be the subspace spanned by the {p x d) matrix (3 of eigenvectors from 
Q, and fig and be the means and covariance, respectively, for the g-th cluster. It can be 
easily seen that the projection of the parameters onto the subspace S{/3) are given by fig and 
P'~^'Sg(3. Eurthermore, the projection of the (re x p) data matrix X onto S{(3) can be computed 
as Z = XP; we call these GMMDR variables. 


Remark 2. The raw coefficients of the estimated directions are uniquely determined up to 
multiplication by a scalar, whereas the associated directions from the origin are unique. Hence, 
we can adjust their length such that they have unit length, i.e. each direction is normalized as 
Pj = i>j/||ujj| for j = I, ... ,d. In matrix form, let D = diag(V'''V), where V is the matrix of 
eigenvectors from Q, then P = VD Eor the GMMDR variables 


Cov(Z) = /3^S/3 = ^ £,-1 ^ diag(l/||uj||2). 


Thus, they are uncorrelated, and with variances inversely related to the squared length of the 
eigenvectors of the kernel matrix, while GMMDR directions are orthogonal with respect to 
S-inner product. 

We now provide some properties as propositions whose proofs are reported in the Appendix 

EH 


Proposition 1. GMMDR directions are invariant under affine transformation x i—?■ Cx + a, 
for C a nonsingular matrix and a a vector of real values. Thus, GMMDR directions in the 
transformed scale are given by C~^p. 

Proposition 2. Each eigenvalue of the eigendecomposition in (© can he decomposed in the sum 
of the contributions given by the squared variance of the between group means and the average of 
the squared within group variances along the corresponding direction of the projection subspace, 

i.e. 

k = Yax{E{Zi\Y)f + E{\ai{Zi\Yf) for i = 1,..., d. 
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Based on this result we may provide an interpretation to the contribution of each direction 
to the visualization of the clustering structure. Furthermore, directions corresponding to small 
eigenvalues provide little or no information about differences in cluster means or covariances. 

Remark 3. Let X be the (n x p) sample data matrix which we assume, with no loss of gener¬ 
ality, to have zero mean column-vectors. The sample version M of ([^ is obtained using the 
corresponding estimates from the fit of a Gaussian mixture model via the EM algorithm. Then, 
GMMDR directions are calculated from the generalized eigen-decomposition of M with respect 
to S. 

Remark 4. The dimension of the estimated subspace S{f3) is d = min{p, G— 1} for models which 
assume equal within-cluster covariance matrices (i.e. models E, Eli, EEI, and EEE), while d < p 
otherwise. The GMMDR directions are ordered based on the associated eigenvalues, so those 
directions corresponding to approximately zero eigenvalues can be discarded in practice since 
clusters largely overlap along such directions. A more formal approach to the selection of 
relevant directions is discussed in Section [3l 


3 Subset selection of GMMDR variables 


In the previous section we have discussed the estimation of GMMDR variables, which is basically 
a linear mapping method onto a suitable subspace. This can be viewed as a form of (soft) 
feature extraction, where the components are reduced through a set of linear combinations of 
the original variables. However, it may be possible that a subset of the estimated GMMDR 
variables provide either no clustering information or redundant clustering information already 
provided by other variables. 

Consider the partitioned (p x p) matrix (/3,/3q), which forms a basis for M^. Based on 
proposition 3 of Cook and Yin (2000, p.l57), if Pq XX\Y and P^XALY then S{P) is 
a DRS. This is useful since, albeit it does not indicate how to construct /3, it indicates that 
unnecessary variables P^X are independent from relevant variables P~^X within groups, and 
that they are also marginally independent from Y. In our case, given that unnecessary GMMDR 
variables provide no clustering information but require parameters estimation, we would like to 
detect and discard them. 

In the following we consider a subset selection method for GMMDR variables following the 
approach of Raftery and Dean (2006), who proposed the use of BIG to approximate Bayes 
factors for comparing mixture models fitted on nested subsets of variables. A generalization of 
this approach has been recently discussed by Maugis et al (2009). 


3.1 A criterion for feature selection 


Let AY be the set of indices with dim(=5^) = q indexing a subset of q features from the original d 
GMMDR variables Z {q < d), and = {AY \i} <Z AY he the set of <Ymi{AY') = q — 1 obtained 
excluding the i-th feature index from AY . The comparison of the two nested subsets can be 
recast as a model comparison problem and addressed using BIG difference as an approximation 
to Bayes factor. Eollowing Raftery and Dean (2006), the BIG difference for the inclusion of the 
i-ih. feature is given by 


BICdis{Zi^y) = BICc\nst{Zy) — RlCnotciust(-^.x’), (4) 

where BICc\nst{Zy) is the BIG value for the “best” model fitted using features in y, and 
BICaotc\vLst{Zy) is the BIG value for no clustering. The latter can be written as 

BICYLotchistiZy) = BlCclustiZy) + BICreg{Zi\Zyi), 

i.e. the BIG value for the “best” model fitted using features in plus the BIG value for the 
regression of the i-th feature on the remaining {q — I) features in AY' . Noting that GMMDR 
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(see [Raftery and Dean 


2006 


eq. (7)) 


variables are orthogonal, the general formula for BlC^eg 
reduces to ^'7 .\7 ^ 

likelihood estimate of the z-th feature variance. In all cases the “best” clustering model is 
identified with respect to both the number of mixture components and model parametrization. 


regiZi\Zyi) = —nlog(27r)—nlog(?j^)—n—(g+1) log(n), where is the maximum 


3.2 A greedy search algorithm for selecting the “best” subset 

GMMDR variables are defined as orthogonal linear combinations of the original variables, and 
it is likely that only a subset of them is needed for clustering. However, the space of all 
possible subsets of size q, with q ranging from 1 to d, has number of elements equal to 2'^ — 1. 
An exhaustive search soon becomes unfeasible, even for moderate values of d. To alleviate 


this problem, Raftery and Dean (2006) proposed a greedy search algorithm for finding a local 


optimum in the model space. The search has both a forward step which evaluates inclusion 
of a proposed variable, and a backward step which evaluates exclusion of one of the currently 
included variables. The algorithm stops when consecutive inclusion and exclusion steps are 
rejected. Since GMMDR variables are orthogonal the search can be simplified by skipping the 
backward step. This allows us to considerably speed up the computing time. 

The search starts by selecting the single GMMDR variable which maximizes the BIG cri¬ 
terion. This is equivalent to using the statistic in equation (|^, with y’' = 0 and =5^ = {z} 
for any z = 1,..., d. Thus, at the first step equation @ measures the difference between the 
best clustering model and the single cluster model for each variable. In the following steps, we 
select the GMMDR variate which maximizes the BIG difference (Q, so taking into account the 
features already included. We keep adding GMMDR variates until the BIG difference becomes 
negative. A detailed description of the greedy search algorithm is reported in Appendix |A.2[ 

At each step, the search over the model space is usually performed with respect to both 
the model parametrization and the number of clusters. However, we may want to fix the 
model parameters at the estimated values obtained using the full set of features; in such a case, 
the order in which the GMMDR directions are selected follows the ordering of the associated 
eigenvalues. 


3.3 An iterative procedure for feature selection 

The greedy search discussed in the previous section is likely to discard some of the GMMDR 
variates, in particular those associated with small eigenvalues which do not carry any clustering 
information once other features have been included. A GMM may then be fitted on such con¬ 
structed variables and the corresponding GMMDR directions estimated. The feature selection 
step can then be repeated, and the whole process iterated until no directions could be dropped. 
Depending on the number of initial features, the number of iterations required is usually small. 
The whole process is described in the following algorithm: 

Algorithm: GMMDR estimation and feature selection process 


1 . 

2 . 

3. 

4. 

5. 


Fit a GMM on the original data. 


Estimate GMMDR directions using the method in Section 2.2 and project the 
data onto the estimated subspace. 


Perform feature selection using the greedy search discussed in Section [3^ 
Fit a GMM on the selected GMMDR variables and go to step 2. 

Repeat steps 2-4 until none of the features could be dropped. 


8 



















4 Simulations 


In this section we present a data anaiysis exampie based on a synthetic data set to describe 
the proposed approach, foiiowed by some simuiation studies. The adjusted Rand index (ARI), 
proposed by Hubert and Arabie (1985), is adopted for evaiuating the ciustering obtained and 
comparing the resuits arising from competing mixture modeis. The ARI gives a measure of 
agreement between two partitions, one estimated by a statistical procedure independent of the 
labelling of the groups, and one being the true classification. This index has zero expected value 
in the case of random partition, and it is bounded above by 1. Thus, higher values represent a 
better performance. 


4.1 Synthetic data on three overlapping clnsters with different shapes 

Consider a simulated data set with three overlapping clusters on ten variables, with each cluster 
sample size equal to = 50 {g = 1, 2, 3). The hrst three variables have been generated from a 
multivariate normal distribution with means fii = [ 0 , 0 , 0 ]"'', //2 = [4, — 2 , 6 ]"'', P 3 = [— 2 , —4, 2 ]"'', 
and within-groups covariances 
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Seven noise variables, generated independently from a standard normal distribution, were also 
added. Thus, clustering information is only contained in the first three variables, with the 
remaining ones been simply noise. The correct model we hope to find is the VVV model with 
G = 3. 

The GMM with the largest BIC is the seven clusters EEI model, closely followed by the 
model with six clusters. Both models have the wrong number of components, thus yielding a 
poor classification accuracy. Constraining G = 3 the selected model improves the ARI, albeit 
there are still some misclassihed data points (see Table [^. Using the “best” unconstrained 
GMM we obtained the GMMDR directions; there are d = 6 of such directions, but only the 
first three are retained by the feature selection step. As it can be seen from Table the hnal 
VVV mixture model with 3 clusters fitted on the three selected GMMDR variables is able to 
recover the original clustering structure. 


Table 1: Clustering results for the three clusters simulated data. 


Method 

Model 

G 

BIC 

ARI 

GMM (unconstrained) 

EEI 

7 

-4731.56 

0.6674 

GMM (constrained) 

VII 

3 

-4820.57 

0.8671 

GMMDR (3 directions) 

VVV 

3 

-1339.40 

1.0000 


The left panel of Figure shows the eigenvalues associated with the GMMDR directions. 
From such a plot we can see that the first two directions are the most important, with the first 
showing both difference in means and variances among clusters, while the second almost only 
differences in means. The last direction adds a small amount of clustering information, mostly 
from differences in variances. The coefficients for the selected GMMDR directions (see the right 
panel of Figure indicate that only the first three variables seem to be required, since the 
coefficients for the remaining ones are almost zero. 

Graphs contained in Figure use GMMDR directions to convey clustering information 
about the fitted GMM. Plots along the diagonal show univariate component densities, while 
plots above the diagonal report contours of the estimated densities for each component. Such 
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Figure 3: Plot of eigenvalues for each estimated direction with contributions from means and 
variances differences among clusters (left panel) and coefficients defining the GMMDR directions 
(right panel). 


graphs clearly reflect the characteristics of the mixture model fitted; in particular, clusters have 
different volume, shape and orientation. Furthermore, it is quite clear that clusters on the 
projection subspace appear to be well separated along the first two directions, as it can be seen 
from the maximum a posteriori (MAP) classihcation areas below the diagonal of Figure]^ 

4.2 Simulation studies 

Here we discuss the results from some simulation studies we conducted to compare the proposed 
GMMDR approach to model-based clustering performed on both the original variables and the 
leading principal components. The data sets used for such comparison are generated from some 
overlapping Gaussian mixtures with different covariance matrix structures. In addition, we 
investigated the effect of the inclusion of a certain number of redundant and noise variables. 
The former are variables correlated with those bringing clustering information, so they only 
provide redundant information. The latter are variables whose distribution do not depend on 
the cluster label, thus they only introduce some noise which tends to obscure the underlying 
clustering structure to be recovered. 

The simulation schemes used for generating the data are described below. 

Model 1: three overlapping clusters with common covariances. In the first simulation 
scheme each data set was generated from three overlapping clusters with equal mixing proba¬ 
bilities on three variables. These were generated from a multivariate normal distribution with 
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Figure 4: Scatter plots of GMMDR variables for the three clusters simulated data: bivariate 
component density contours (above diagonal), marginal univariate component densities (diago¬ 
nal), and MAP regions (below diagonal). 


means = [ 0 , 0 , 0 ]"'^, ^2 — [ 0 > 2 , 2 ]"'", /X 3 = [ 2 , — 2 , — 2 ]"'^, and common covariance matrix 


S = 


2.0 0.7 0.8 
0.7 0.5 0.3 
0.8 0.3 1.0 


To this basic setup, which correponds to a FEE mixture model (see 
Table 1), we added two further scenarios. In the first scenario [noise variables), seven noise 
variables were generated from independent standard normal variables. In the second scenario 
[redundant and noise variables), we generated three variables correlated with each clustering 
variable (with correlation coefficients equal to 0.9, 0.7, 0.5, respectively) and four independent 
standard normal variables. In both cases, there was a total of ten variables. 


Fraley and Raftery 2006 


Model 2: three overlapping clusters with constant shapes. In this simulation scheme 
we generated observations with equal prior from three classes using the mixture model VEV. 
The means for each class were = [0,0,0]"*", ^2 = [4,—2,6]"'^, and = [—2,—4, 2]"'^, while 
for the covariance matrices = XgDgAD^ [g = 1,2,3) the scale, shape and orientation 
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parameters were, respectively, A = [0.2,0.5,0.8]"'^, A = diag(l, 2,3), and 
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As for the previous model setup, we also considered two further scenarios with the addition of 
noise variables in the first case, and noise and redundant variables in the second case. 


Model 3: three overlapping clusters with unconstrained covariances. For this last 
scheme we simulated each data set from three overlapping clusters with equal mixing prob¬ 
abilities on three variables. Clustering variables were generated from a multivariate normal 


distribution with parameters set as in the synthetic example of Section 4.1, which correspond 
to a VVV parametrization with G = 3. Again, we also considered the effect of adding only 
noise variables, and noise and redundant variables. 

For each data set simulated as described above we fitted a GMM with number of clusters 
in the range [1,15] and different model parametrizations, then we selected the model with the 
highest BIC. A GMM was also fitted to the leading principal components computed from the 
correlation matrix, or equivalently using standardized variables; the number of components was 
chosen using the Kaiser’s rule, i.e. selecting those with eigenvalues larger than one (Jolliffe 
p. 114-115). Finally, we applied the GMMDR approach discussed in Section 2.2 fol- 


2002 


lowed by the subset selection procedure of Section For each estimated model we computed 
the adjusted Rand index (Hubert and Arabie, 1985[) as a criterion to evaluate the clustering 


partitions obtained: higher values of ARI correspond to better performance. 

The results of the simulation procedure based on 1000 repetitions for increasing sample sizes 
are shown on Table where we reported the average adjusted Rand index for the three fitting 
procedures to be compared and the corresponding standard errors. 

When no noise variables are included, the GMMDR procedure yields equivalent clustering 
accuracy to GMM on the original variables. However, when irrelevant features (either noise 
or redundant ones) are present the GMMDR procedure significantly improves the accuracy as 
measured by the adjusted Rand index. This is particular evident for smaller sample sizes {n = 
100, 300) when noise variables are present, and when the underlying clusters have different within 
cluster covariance matrices. The PGA-I-GMM procedure leads to uniformly small values of ARI, 
confirming that reducing the dimensionality of the problem through principal components does 
not help to discover clustering structures in the data. As expected, as the sample size increases 
all the three methods improve in accuracy, with the GMM method on the original variables and 
the GMMDR approach leading essentially to the same accuracy for n = 1000. 

A further aspect was also checked, the number of clusters selected by each procedure (results 
not shown here). Recalling that for all the simulation schemes the number of clusters was three, 
GMMDR almost always selects the true value, whereas GMM tends to select a larger number 
of components when noise variables are included and the clusters have different covariance 
structures. PGA-I-GMM performs worst than the other two methods, often under-estimating 
the true number of clusters in the presence of noise variables and for smaller sample sizes 
(n = 100,300). 

Based on the simulation results we may conclude that when there are redundant or noise 
variables a suitable subset of GMMDR variables allows us to significantly improve cluster identi¬ 
fication. The improvement is larger for small samples having different within-cluster covariances 
and in the presence of noise variables. Overall, using GMMDR variables do not produce worse 
results than using original variables, but often leads to a better selection of the clustering struc¬ 
ture. There appears no reason to perform model-based clustering on principal components. 
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Table 2: Average adjusted Rand index (with standard errors within parenthesis) based on 1000 
simulations. 




No noise 

Noise variables 

Noise and redundant variables 

Model 1 (EEE) 

n = 100 

n = 300 

n = 1000 

n = 100 

n = 300 

n = 1000 

n = 100 

n = 300 

n = 1000 

GMM 

0.9684 

0.9744 

0.9754 

0.6869 

0.8918 

0.9742 

0.8518 

0.9695 

0.9743 


(0.0420) 

(0.0166) 

(0.0083) 

(0.1363) 

(0.1217) 

(0.0090) 

(0.1889) 

(0.0241) 

(0.0087) 

PCA+GMM 

0.8591 

0.9105 

0.9146 

0.4989 

0.8352 

0.9091 

0.5361 

0.7518 

0.7942 


(0.1626) 

(0.0446) 

(0.0221) 

(0.1700) 

(0.1256) 

(0.0298) 

(0.1342) 

(0.0909) 

(0.0239) 

GMMDR 

0.9716 

0.9742 

0.9753 

0.8612 

0.9674 

0.9742 

0.8832 

0.9699 

0.9742 


(0.0347) 

(0.0172) 

(0.0088) 

(0.1426) 

(0.0234) 

(0.0090) 

(0.1613) 

(0.0197) 

(0.0088) 

Model 2 (VEV) 

n = 100 

n = 300 

n = 1000 

n = 100 

n = 300 

n = 1000 

n = 100 

n = 300 

n = 1000 

GMM 

0.9738 

0.9802 

0.9819 

0.7316 

0.6328 

0.9808 

0.6648 

0.6811 

0.9807 


(0.0302) 

(0.0141) 

(0.0073) 

(0.0825) 

(0.0793) 

(0.0081) 

(0.0913) 

(0.1572) 

(0.0078) 

PCA+GMM 

0.6562 

0.6789 

0.7130 

0.1907 

0.4307 

0.6458 

0.4579 

0.6465 

0.8018 


(0.2745) 

(0.2138) 

(0.0824) 

(0.1386) 

(0.1700) 

(0.1417) 

(0.1780) 

(0.1483) 

(0.0309) 

GMMDR 

0.9709 

0.9802 

0.9819 

0.9201 

0.9747 

0.9806 

0.9231 

0.9727 

0.9806 


(0.0351) 

(0.0141) 

(0.0073) 

(0.0799) 

(0.0165) 

(0.0082) 

(0.0811) 

(0.0169) 

(0.0077) 

Model 3 (VVV) 

n = 100 

n = 300 

n = 1000 

n = 100 

n = 300 

n = 1000 

n = 100 

n = 300 

n = 1000 

GMM 

0.9960 

0.9981 

0.9983 

0.6937 

0.8877 

0.9751 

0.8434 

0.9708 

0.9742 


(0.0115) 

(0.0043) 

(0.0024) 

(0.1260) 

(0.1189) 

(0.0081) 

(0.1932) 

(0.0194) 

(0.0092) 

PCA+GMM 

0.9805 

0.9896 

0.9899 

0.4917 

0.8357 

0.9074 

0.5396 

0.7493 

0.7945 


(0.0534) 

(0.0105) 

(0.0053) 

(0.1663) 

(0.1277) 

(0.0316) 

(0.1353) 

(0.0984) 

(0.0236) 

GMMDR 

0.9952 

0.9981 

0.9983 

0.8643 

0.9674 

0.9751 

0.8799 

0.9712 

0.9740 


(0.0146) 

(0.0042) 

(0.0024) 

(0.1384) 

(0.0210) 

(0.0081) 

(0.1609) 

(0.0185) 

(0.0099) 


4.2.1 Unequal prior probabilities 

The previous simulation examples assumed equal prior group probabilities. It may be of interest 
to investigate if the results obtained depend on such assumption. We expect that, provided there 
is enough cluster information in the data, no dramatic change from previous findings should 
be observed. Thus, we replicated the previous simulation study for the more complex Model 3 
(VVV), but with unequal prior probabilities set at tt = (0.1, 0.3,0.6). The results are shown in 
Table 0 

When no noise variables are included the accuracy of GMM and GMMDR are essentially 
the same as in the equal prior probabilities case, whereas PCA+GMM performs slightly worse. 
If irrelevant variables (either noise or redundant ones) are present both GMM and GMMDR 
yield somewhat worse results when n = 300 or 1000. This seems to be due to the fact that 
BIG often selected models with more components than the true number of clusters. On the 
contrary, PGA+GMM seems to improve when redundant variables (i.e. variables correlated with 
clustering ones) are included. Since principal components are computed from the correlation 
matrix, we argued that correlated variables are likely to provide useful clustering information 
when some groups are represented by few observations. Overall, we note that GMMDR never 
performed worst than GMM, and it was also better than PCA+GMM except when irrelevant 
variables are included and n = 300. 


4.2.2 High-dimensional data 


Clustering high-dimensional data is a challenging task. The main problem in fitting a GMM 
model is the large number of parameters which need to be estimated. To overcome this Ghosh| 


and Chinnaiyan (2002) suggested to perform model-based clustering on the first few principal 


components. Based on the conclusion of Chang (1983), discussed also in Section]^ this approach 
does not seem to be advisable. Bouveyron et al (2007) proposed a method for dealing with high 
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Table 3: Average adjusted Rand index (with standard errors within parenthesis) based on 1000 
simulations with unequal prior probabilities. 


Model 3 (VVV) 


No noise 

Noise variables 

Noise and redundant variables 


n = 100 

n = 300 

n = 1000 

n = 100 

n = 300 

n = 1000 

n = 100 

n = 300 

n = 1000 

GMM 

0.9826 

0.9991 

0.9994 

0.7451 

0.6195 

0.7407 

0.8214 

0.7326 

0.7209 


(0.0502) 

(0.0025) 

( 0 . 0010 ) 

(0.1468) 

(0.1663) 

(0.1160) 

(0.1433) 

(0.1188) 

(0.0937) 

PCA-^GMM 

0.8781 

0.8644 

0.8606 

0.8305 

0.8347 

0.7398 

0.8334 

0.8434 

0.8228 


(0.0970) 

(0.0871) 

(0.0532) 

(0.0748) 

(0.1050) 

(0.2017) 

(0.0590) 

(0.0353) 

(0.1054) 

GMMDR 

0.9742 

0.9991 

0.9994 

0.8733 

0.8176 

0.7964 

0.8720 

0.8073 

0.8136 


(0.0734) 

(0.0024) 

( 0 . 0011 ) 

(0.1341) 

(0.1378) 

(0.1547) 

(0.1303) 

(0.1601) 

(0.1600) 


dimensional data. 


In Section 4.2 we described a three clusters VEV model (2) on 3 clustering variables. We 


also considered the inclusion of 3 redundant variables (i.e., correlated with the clustering ones) 
and 4 noise variables; there was a total of ten variables according to the scheme {3|3|4}. This 
basic setup has been generalized to investigated the behavior of the proposed approach as the 
number of variables increase. For k = {1,2,3, 5} we generated lOA; variables following the 
scheme {3A:|3A;|4A:}; for instance, if A: = 3, there are 30 variables, 9 of them are clustering 
variables, 9 are correlated with the previous ones, and the remaining 12 are noise variables. We 
only investigated the most difficult case, i.e. the case of a small sample size, by setting n = 100 . 

As the number of variables increase the accuracy of all the three methods tend to decrease, 
with GMMDR always more accurate on the average that the other two methods (see Table [^. 
However, for p = 40 and p = 50 the average ARI for GMMDR dropped just above 0.6. Looking 
in more detail the results, we realized that, even though a subset of directions could provide 
better cluster accuracy, BIG frequently selected too many of such directions. To hnd evidence of 
this fact we replicated the simulation study, but using the simple model Eli (common spherical 
covariance matrix) and selecting the GMMDR directions on the basis of the entropy criterion 


(Celeux and Soromen, 1996). This provides a measure of the overlap of the clusters, and it is 


defined as — ^ig ^og{tig), where tig denotes the conditional probability that the i-th 

unit arises from the ( 7 -th mixture component. Except for the case p = 10, where the ARI is 
smaller, and p = 20, where the ARI is essentially equivalent, the accuracy of GMMDR is largely 
increased (see Table Q. Note also that, although we constrained to fit the wrong Eli model, 
the accuracy of GMM do not worsen. 

Based on this limited study, it seems that for high-dimensional data GMMDR directions 
are still able to recover the clustering structure of the data, but the BIG criterion overestimates 
how many of them are needed. When the entropy criterion is used to select relevant directions 
the clustering accuracy improve. 

An extreme form of high-dimensional data is the p^ n case. Clustering of gene expression 
data arising from microarray experiments is a typical application and model-based approaches 


have been proposed (Yeung et al 2001). In order to apply the approach discussed in this paper, 


a form of regularization is required for the inversion of the covariance matrix in (|^, such as 
one of those discussed in Bernard-Michel et al (2009). However, these aspects require further 
studies. 


5 Data analysis examples 

5.1 Italian wines 


Forina et al (1986) reported data on 178 wines grown in the same region in Italy but derived 


from three different cultivars (Barolo, Grignolino, Barbera). For each wine 13 measurements of 
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Table 4: Average adjusted Rand index (with standard errors within parenthesis) based on 500 
simulations for increasing dimensionality of the feature space and n = 100. The last two rows 
show the results from fitting a GMM with a common spherical and equal volume model (Eli) 
and the corresponding GMMDR with relevant directions selected by the entropy criterion. 


Model 2 (VEV) 

p = 10 

p = 20 

II 

CO 

o 

O 

II 

p = 50 

GMM 

0.6788 

0.6405 

0.6128 

0.5968 

0.5859 


(0.0850) 

(0.0702) 

(0.0744) 

(0.0726) 

(0.0783) 

PCA-pGMM 

0.4602 

0.7238 

0.6928 

0.6485 

0.6009 


(0.1831) 

(0.1655) 

(0.1574) 

(0.1582) 

(0.1283) 

GMMDR 

0.9205 

0.9717 

0.8447 

0.6802 

0.6131 


(0.0843) 

(0.0951) 

(0.1960) 

(0.2092) 

(0.1920) 

GMM (Eli) 

0.7101 

0.6581 

0.6299 

0.6148 

0.5979 

(0.0889) 

(0.0707) 

(0.0696) 

(0.0753) 

(0.0765) 

GMMDR 

0.8447 

0.9830 

0.9817 

0.9204 

0.8381 


(0.1672) 

(0.0837) 

(0.0801) 

(0.1528) 

(0.1923) 


chemical and physical properties were made, such as the level of alcohol, the level of magnesium, 
the color intensity, etc. The data set is avalaible at UCI Machine learning data repository 
http://archive.ics.uci.edu/nil/datasets/Wine. The following analyses were performed on 
a standardized scale. 

Modelling all the available variables with a GMM selects the model VEI with eight clusters, 
which gives a small adjusted Rand ind ex (see Table [Sj). T his can be largely improved by the 
variable selection method proposed by Raftery and Dean (2006). A comparable accuracy is 
also achieved by McNicholas and Murphy (2008), who applied a class of parsimonious Gaussian 
mixture models (PGMM) based on mixtures of factor analyzers; their selected model has q = 2 
latent factors and it is denoted as CUU (see McNicholas and Murphy, 2008, Table 1). We note, 
however, that, despite the improvement on accuracy, PGMM selects a wrong number of clusters 
(see Table [^. 


Table 5: Model-based clustering results for some models htted to the wine data. 


Method 

Number of 
features 

Model 

G 

ARI 

GMM 

13 

VEI 

8 

0.48 

GMM (best subset'1') 

5 

VEV 

3 

0.78 

PGMM 

2 

CUU 

4 

0.79 

GMMDR 

5 

VEV 

3 

0.85 


1(Malic, Proline, Flavainoids, Intensity, 0D280) 


We applied the GMMDR approach to this data set and we obtained a final VEV model 
with three clusters estimated on five GMMDR directions. This model yields an adjusted Rand 
index equal to 0.85 (see Table and the resulting confusion matrix is reported in Table 

As Eigure shows, the first two directions mainly exhibit differences in cluster means, 
whereas the remaining directions mostly represent differences in cluster variances. We also note 
that the first three GMMDR directions appear to contain most of the clustering information. 

Figure shows a static view of a rotating 3D spin plot for the first three GMMDR directions 
with data points marked according to cluster membership; Barbera wines (3) appear to be well 
separated from the others along the first direction, while Barolo (1) and Grignolino (2) wines 
present a significant separation on the plane identified by the second and third directions. 
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Wines 

Gluster 

1 2 3 

Barolo 

57 

2 

0 

Grignolino 

2 

64 

5 

Barbera 

0 

0 

48 


Table 6: A classification table for the best model found using GMMDR on the wine data. 



Figure 5: Plot of eigenvalues for each GMMDR direction with contributions from means and 
variances differences among clusters for the wine data. 


5.2 Australian crabs data 

This data set contains measurements recorded on 200 specimens of Leptograpsus variegatus 
crabs on the shore in Western Australia (Gampbell and Mahon, 1974). Grabs are classified 


according to their colour form (blue and orange) and sex. There are 50 specimens of each sex 
of each species. Each specimen has 5 measurements: frontal lobe size (FL), rear width (RW), 
carapace length (GL) and width (GW), body depth (BD). The main interest is to distinguish 
between species and sex based on the available morphological measurements. 

Gaussian mixture modelling performs poorly for this dataset. The model with the highest 
BIG selects 9 components, which gives a small adjusted Rand index. If we constrain the number 
of components to match the actual number of classes, the performance is still poor (see Table [^. 
Raftery and Dean (2006) obtained a large improvement through variable selection, ending up 
with a mixture model based on four variables. This dataset was also analyzed by |Bouveyron et"^ 
( |2007 pp. 516-517) who obtained a 5% error rate with the high-dimensional data clustering 


(HDDG) model [aibiQidi]. 

Starting with the “best” unconstrained mixture model (EEE, G = 9) we estimated the 
GMMDR directions and then we performed feature selection. The final GMMDR model is 
fitted on three directions with 4 clusters having ellipsoidal, equal volume and shape but different 


orientation (EEV). The error rate is 7.50%, the same obtained by Raftery and Dean (2006). 
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Figure 6: A static view of a 3D spinning plot of the first three GMMDR directions for the wine 
data. Data points are marked according to cluster membership; the clusters identified refer 
mainly to (1) Barolo, (2) Grignolino, and (3) Barbera. 


The resulting accuracy is slightly worst than that for the HDDC model of|Bouveyron et al 


(2007), but it has the advantage of providing a visualization of the clustering structure. 


Table 7: Model-based clustering results for some models fitted to the crabs data. 


Method 

Model 

G 

Error 
rate (%) 

ARI 

GMM (unconstrained) 

FEE 

9 


0.4831 

GMM (constrained) 

VEV 

4 

42.50 

0.3165 

GMM (best subset^^) 

EEV 

4 

7.50 

0.8154 

HDDC 

\Q>i hiC^id>j\ 

4 

5.00 


GMMDR 

EEV 

4 

7.50 

0.8195 


t(CW, RW, FL, BD) 


Summary plots (not shown here) indicate that the first direction mainly separates blue 
from orange type of crabs, while the second reflects sex differences. The last direction adds 
only marginal information contrasting blue and orange crabs within sex. Figure plots the 
uncertainty Ui = 1 — maxg(zjg) projected onto the first two GMMDR directions, where %g is 
the estimated conditional probability that observation i belongs to group g. Misclassified crabs 
are indicated by a square surrounding the point. As it can be seen, orange and blue crabs 
appear well separated, whereas discrimination of sex within crabs species is more problematic: 
some points lie in the regions of higher uncertainty (darker areas), in particular for blue crabs. 
R is interesting to note that these aspects can also be read-off from the corresponding confusion 
matrix (see Table [^. 
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Species 

1 

Cluster 

2 3 

4 

BE 

50 

0 

0 

0 

BM 

12 

38 

0 

0 

OF 

0 

0 

47 

3 

OM 

0 

0 

0 

50 


Table 8: A classification table for the final GMMDR model fitted on the crabs data. 

6 Conclusion 

In this paper we addressed the problem of dimension reduction for model-based clustering from 
the point of view of visualizing the clustering structure. Often, principal component analysis 
is used as a pre-processing step, and a clustering procedure is applied to the leading princi¬ 
pal components. Probabilistic principal components and factor analyzers have been also used 
for dimension reduction. However, these do not necessarily capture the underlying clustering 
structure. 


Unlike other approaches ( 

Tipping and Bishop 

ToMalb; 

McLachlan et al 

2003 

Bouveyron 

et al 

2007; 

McNicholas and Murphy 

2008 

), the proposed method is not connected to any un- 


derlying latent variables structure, nor an EM-like algorithm is employed in the estimation. 
Instead, we aim at identifying the subspace which capture most of the clustering information 
contained in the data. Estimation of directions spanning such subspace is based on the eigen- 
decomposition of a suitable kernel matrix, with unknown parameters estimated from a GMM 
selected as the best description of the data at hand. Visualization techniques can then be ap¬ 
plied using the leading directions in order to obtain summary plots. The estimated directions 
can be further reduced by recasting the problem of feature selection as a model choice problem 
and using BIG to approximate Bayes factors. A forward greedy search algorithm is discussed 
to avoid the search over all possible subsets. 

The proposed methodology has been applied to simulated and real data sets. In all cases 
we were able to identify a dimension reduced subspace while preserving the cluster information 
contained in the original variables. Moreover, simulation studies showed that a suitable subset 
of directions allows for significant improvement in cluster identification. The improvement is 
larger when within-cluster covariances are different and redundant or noise variables are present. 

Finally, it is straightforward to extend the proposed approach in a supervised context, where 
the class membership for each unit is known in advance. 


Appendix 

A.l Proofs 

Proof of proposition Define the transformed data matrix as X* = XC + la^, where 
C is a, {p X p) nonsingular matrix, a is a p-vector of real values, and 1 is a n-vector of ones. 
It is easy to show that in the transformed scale 51* = and the kernel matrix is given 

by M* = MC. The generalized eigendecompostion in the transformed scale satisfy the 
equation MCV* = 'SCV*L*, so CV* must be equal to the matrix V of eigenvalues of the 
generalized eigendecomposition of M with respect to 51, and L* is equal to the corresponding 
diagonal matrix of eigenvalues. Hence, V* = C~^V, L* = L, and the GMMDR directions 
in the transformed scale are given by (3* = C~^l3. Furthermore, the corresponding variates 
Z* = X*(3* = X(3 + la^C~^f3, so they are simply a shifted version of the GMMDR variates 
in the original scale. 
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Figure 7: Plot of crabs data points with uncertainty areas projected onto the first two GMMDR 
directions. Data points are marked according to the estimated cluster membership, while a 
square surrounding a point indicates a misclassified crab. 


Proof of proposition For any kernel matrix we may rewrite the eigendecomposition in Q 
as MV = YtVL. Since by definition V~^J1V = Id, the diagonal matrix of eigenvalues can be 
expressed as L = diag[/j] = V~^MV. 

We start by considering the kernel matrix Mj, for which Mj = Var(E(X|y)). Thus, 

L = ■p^Var(E(X|y))P = Var(E(Z|y)) = diag[Var(E(yi|y)], 

where Z = XV is the projection onto the subspace spanned by the columns of V. So, for a 
SIR kernel matrix each eigenvalue is equal to the variance of the between group means along the 
i-th direction. Consider the kernel matrix in Q, for which the diagonal matrix of eigenvalues 
can be written as 

L = V^Mi'E-^MiV + V^MnV. 

We note that since V~^'SV = Id we have = VV~^. Thus, the first part of the above 
equation may be written as 

V^MiS-^MiV = {V^MiV){V^MiV) = Var(E(Z|y))2. 


Eor the following part we have 


V^MiiV = V'^ 


G 


5=1 





V = E(Var(Z|y)2). 


Therefore, the matrix of eigenvalues can expressed as L = Var(E(.Z|y))^ + E(Var(.Z|y)^), and 
proposition follows. 
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A.2 Greedy search algorithm for feature selection 


In this section we provide a detailed description of the greedy search algorithm used to select 
the “best” subset of GMMDR variates. 


The BIG criterion ( 

Schwartz 

1978 

) is the basic building block for choosing the number of 

clusters and model parametrization (Fraley and Raftery 

1998 

). In general, for a finite mixture 


model M with g components it is computed as 


g) = 2 max I — i'M,g 


where maxi is the maximized log-likelihood, I'M.g is the number of parameters estimated, 
and n is the number of observations. The difference between BIG values for two models is 
approximately equal to twice the logarithm of the Bayes factor for the comparison of the two 


models (Kass and Raftery, 1995). Raftery and Dean (2006) investigated variable selection in 


model-based clustering, recasting the problem as a model selection procedure. We adapt their 
approach for GMMDR variable selection. 

Let =5^ = {1, 2,..., d} be the set of GMMDR variables Zy computed for a mixture model 
following the method described in Section 2.2 The criterion used for model comparison is the 


BIG difference in equation (Q. The steps of the greedy search algorithm are the following: 


1. Select the first variable to be the one which maximizes the BIG difference criterion. Let 
5^\ = {i} be the candidate set, and 5^[ = 0 be the set of already included variables, which 
is of course empty at the beginning. We select the variable Zi^ such that 

ii = argjg^max BICdm{Zy^) 

— ^^Si£y (.SLGciust) BlC^^gi^Zi^'). 

At the first step dim(=5^i) = 1 , thus maximization is taken over the clustering models E 
and V with G = 1 , 2 ,..., max(G), where max(G) is the maximum number of clusters to 
be considered for the data. Then, define = {R}, set j = 2 and go to the next step. 

2 . Select a variable to include, among those not already included, to be the one which 

maximizes the BIG difference criterion. Formally, let U {i} be the candidate 

set, and 5^'^ = be the set of already included variables. We choose the variable Zi. 

to be included such that 

D' = argj 6 j^\j?y max BICdxs{Zy^) 

In this case the “best” model is identified with respect to both the number of mixture 
components up to max(G), and model parametrization. Then, update the subset of 
currently included variables, i.e. U {ij}- 

3. Set i = J -|- 1 and iterate the previous step until a stopping rule is met. The algorithm 
naturally terminates when all variables are included, but it might be stopped earlier when, 
for example, the maximal BIG difference for the inclusion of a variable becomes negative. 


As mentioned, the proposed greedy search is a forward-type algorithm. Given the orthogo¬ 
nality of the GMMDR variables, a backward step is not required. 
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